Experiments with multi-label text classifier on the Reuters collection

نویسندگان

  • Domonkos Tikk
  • György Biró
چکیده

Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present an approach on hierarchical text categorization that is a recently emerged subfield of the main topic. Here, documents are assigned to leaf-level categories of a category tree (called taxonomy). The algorithm applies an iterative learning module that allow of gradually creating a classifier by weight adjusting method. We experimented on the well-known Reuters-21578 document corpus with different taxonomies. Results show that our approach outperforms existing ones by up to 10%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Associations between Class Labels in Multi-label Classification

Multi-label classification has many applications in the text categorization, biology and medical diagnosis, in which multiple class labels can be assigned to each training instance simultaneously. As it is often the case that there are relationships between the labels, extracting the existing relationships between the labels and taking advantage of them during the training or prediction phases ...

متن کامل

Discretizing Continuous Attributes in AdaBoost for Text Categorization

We focus on two recently proposed algorithms in the family of “boosting”-based learners for automated text classification, AdaBoost.MH and AdaBoost.MH. While the former is a realization of the well-known AdaBoost algorithm specifically aimed at multi-label text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Bo...

متن کامل

MLIFT: Enhancing Multi-label Classifier with Ensemble Feature Selection

Multi-label classification has gained significant attention during recent years, due to the increasing number of modern applications associated with multi-label data. Despite its short life, different approaches have been presented to solve the task of multi-label classification. LIFT is a multi-label classifier which utilizes a new strategy to multi-label learning by leveraging label-specific ...

متن کامل

Text categorization on Reuters corpus

1 Task The task of text categorization can be described as follows: given a set of documents, we want to assign to each document one or more text categories or no category. In this term project, we want categorize documents from the well-known Reuters-21578 corpus which is a collection of 21578 articles published on Reuters in 1987. We have chosen only three most frequent text categories as the...

متن کامل

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003